Reduce Validation and Proposal message relaying #3412

gregtatcam · 2020-05-21T19:57:41Z

The motivation for the reduce relay is to reduce CPU and bandwidth cost and improve link latency.
Every node independently goes through the peer selection rounds at some random intervals.
MAX_PEERS are randomly selected from the pool of directly connected peers with the Validation and Proposal message count from a validator in {MESSAGE_LOW_THRESHOLD,MESSAGE_UPPER_THRESHOLD}. The peers are chosen to be the source of Validation and Proposal messages from this validator. Protocol message is sent to the rest of directly connected peers instructing them to squelch (stop relaying) the message for some random period of time in {MAX_UNSQUELCH_EXPIRE,MIN_UNSQUELCH_EXPIRE}.
When the squelch expires or when a new peer connects to the node, another peer selection round starts.
When a peer, which is selected to be the source of the messages for the given validator disconnects, or is idled then unsquelch message is sent to all squelched peers and another peer selection round starts.

pwang200

I had a quick first pass and had some minor comments. I will get better understanding of the overlay relay code, then finish the review later.

src/ripple/overlay/Message.h

src/ripple/overlay/Squelch.h

src/ripple/overlay/SquelchCommon.h

src/ripple/overlay/Slot.h

src/ripple/overlay/impl/PeerImp.cpp

src/ripple/overlay/impl/OverlayImpl.h

src/ripple/overlay/impl/PeerImp.cpp

src/ripple/overlay/Slot.h

seelabs

Some minor feedback on the structure of the PR:

It's easier for reviewers if the commits are organized into logical units. This PR has 70 commits, and those commits reflect the way the feature was developed. As a reviewer, I'm less interested in the timeline of how the feature was developed; I'm more interested in "here's a patch with all the XXX changes".

The biggest issue in this commit is making the Slots and Squlech classes generic (especially Slots). For non-test code they do not need to be generic and do not need callback functions; It would be a HUGE win for code understandably if we could somehow test this without making those classes generic.

src/ripple/overlay/Message.h

src/ripple/overlay/impl/OverlayImpl.cpp

src/ripple/overlay/Slot.h

src/ripple/overlay/Squelch.h

seelabs · 2020-06-05T02:14:17Z

src/ripple/overlay/Slot.h

+ * doesn't count messages in this state but a received message
+ * may transition Slot to Counting state.
+ */
+template <typename Peer, typename clock_type>


I understand the motivation to make this class generic - it help with testing. However, making this generic makes the code much harder to navigate and understand. We really need to think hard if there's another way to test the code that doesn't necessitate making this class generic.

This is not just a nit to me. Making this class use the concrete peer type, getting rid of all the generic callback functions, and getting rid of the comparator parameter would be a huge improvement in code understandably.

I do know we did something similar in the consensus code. Please think hard if there's another, better solution to the testing problem;

We can eliminate the template all together. The idea is to introduce SquelchHandler abstract class, which declares the squelch/unsquelch handlers. Then implementation in Overlay and UnitTest encapsulate specific object Peer type and implement the handlers. We can provide on a fly handler for UnitTest where we want to have some flexibility. The only disadvantage that I see so far of this approach is that instead of storing Peer in the Slot we are storing SquelchHandler so we have to instantiate this object for every Peer and validator. So the additional memory footprint is nValidators * nPeers. The concept is explained in this code example: https://gist.github.com/gregtatcam/4c2a680ab3e3cd5095f130bfb329e19f

First, thanks for diving into this.

There are a couple of ways we could try to solve this:

Templates and callbacks (the current approach). My biggest objection is ability to navigate and understand the code.

New base class for Overlay. I haven't coded this up, but I suspect there isn't enough shared implementation between the test and non-test code. So we'll end up testing something very different from the production code.

variant based wrapper. I like the concreteness of this, but the test code requires several wrapped types. This solution works best if there was only one (or very few) wrapped types.

type erased wrapper (the approach in your patch). This requires an extra allocation, but it is an improvement over the template based approach from a code navigability POV. We may be able to remove the allocation with small object optimization techniques.

Write a test framework that can handle the Overlays and Peers. This by far gives the cleanest, most understandable code. But we (1) don't have that framework and (2) unit tests are still valuable.

Given the above, I vote that we go forward with your proposed patch. However, you might want to wait to hear from other reviewers to see if they have other ideas on this issue as well.

We can implement Peer class in unit tests with most of the methods simply returning some value. This is not going to be the test framework or a step towards implementing the test framework but a way to avoid the extra allocation in (4). We can then pass std::weak_ptr<Peer> to all applicable slot methods as we do today with Peer being a concrete type so no template is required. We will have abstract class SquelchHandler with squelch and unsquelch methods. The class is going to be implemented in Overlay and UnitTest. Reference to the base class is going to be passed to Slots and Slot constructors and squelch and unsquelch called as needed. UnitTest can overwrite the handling on a fly providing flexibility of different handlers as needed. Does this work?

seelabs

Nice job getting rid of the templates and callbacks.

I left another set of minor comments, but I suspect that's the last batch from me.

src/ripple/app/misc/NetworkOPs.cpp

src/ripple/overlay/Slot.h

seelabs · 2020-06-16T16:56:26Z

src/ripple/overlay/Slot.h

+ * unsquelched, disconnected peer, or idling peer may transition Slot to
+ * Counting state.
+ */
+template <typename clock_type>


Now that this class is simplified to take a single template parameter, consider moving the implementation to a .cpp file and explicitly instantiating the clocks we need for linking.

Wouldn't we have to move ManualClock from the unit test to the Slot's implementation in this case?

We'd have to move the ManualClock to a common header file, and we normally would not include a test file like that in the main code (tho I think we have a good reason in this case). But let's not worry about that now, we can do that later in another PR if we want.

I'd rather not have a test file in the main code. I think a better way is to get rid of the clock template parameter altogether, in another PR.

src/ripple/overlay/Message.h

src/ripple/overlay/Slot.h

src/test/overlay/reduce_relay_test.cpp

src/ripple/overlay/Slot.h

seelabs

👍 Let a couple minor comments that do need to be addressed (a typo and the order of includes), but those changes are so trivial I'm OK approving it now.

src/ripple/overlay/Slot.h

src/ripple/overlay/Squelch.h

pwang200

I tried the code with several test cases. They worked fine.

seelabs

Reviewed the latest patch, left some comments.

src/ripple/overlay/Slot.h

src/test/overlay/reduce_relay_test.cpp

src/ripple/app/misc/NetworkOPs.h

src/ripple/core/impl/Config.cpp

src/ripple/overlay/Slot.h

src/ripple/overlay/impl/PeerImp.cpp

src/ripple/overlay/impl/OverlayImpl.cpp

pwang200

Let me confirm my understanding.

we count and only count for relayed messages, no matter if the messages are trusted or not.
for the same message, all the peers forwarded the message to the local node are recorded by the hashRouter.
there are two code paths that peers are counted by the slots logic:
-- if they are recorded by the hashRouter before the relayed timestamp is set, they are counted right after the local node try to relay the message. This also works for the case that the hashRouter has recorded all the peers (i.e. the message won't actually be sent.)
-- (note: thread safety (for the atomic operation between the relayed timestamp and the set of peers) is protected by the mutex in hashRouter)
-- else (after the relayed timestamp is set), the peers are counted when their messages are received.
-- (note: for both of the cases, peers are counted only if the message is relayed)
the container (peersWithMessage_) in slots is only for preventing over counting.
if a peer forwards the message too late (8 seconds after the relayed timestamp was set), then the peer is not counted, which is OK.

src/ripple/overlay/Slot.h

pwang200 · 2020-07-22T18:05:41Z

src/ripple/overlay/impl/OverlayImpl.h

+     * the message has been  relayed.
+     * @param key Unique message's key
+     * @param validator Validator's public key
+     * @param id Peers id


should be peers

pwang200 · 2020-07-22T21:27:55Z

src/ripple/app/misc/NetworkOPs.h

+     * have already seen the messge; i.e.
+     * the message has been received from
+     * these peers and added to the hash
+     * router. */


Is "Return the set of peers from which the local node has received the message." accurate?

It is existing behavior, but could you please put a note saying this function will relay valid proposal messages?

pwang200 · 2020-07-23T04:39:05Z

src/ripple/overlay/Slot.h

+
+template <typename clock_type>
+void
+Slots<clock_type>::unsquelch(id_t const id)


What is the benefit of this function? I see this function is called by OverlayImpl::deletePeer(Peer::id_t const id). I feel it might be easier to understand if deletePeer(id, true); was called directly.

gregtatcam

Let me confirm my understanding.

we count and only count for relayed messages, no matter if the messages are trusted or not.

for the same message, all the peers forwarded the message to the local node are recorded by the hashRouter.

there are two code paths that peers are counted by the slots logic:
-- if they are recorded by the hashRouter before the relayed timestamp is set, they are counted right after the local node try to relay the message. This also works for the case that the hashRouter has recorded all the peers (i.e. the message won't actually be sent.)
-- (note: thread safety (for the atomic operation between the relayed timestamp and the set of peers) is protected by the mutex in hashRouter)
-- else (after the relayed timestamp is set), the peers are counted when their messages are received.
-- (note: for both of the cases, peers are counted only if the message is relayed)

the container (peersWithMessage_) in slots is only for preventing over counting.

if a peer forwards the message too late (8 seconds after the relayed timestamp was set), then the peer is not counted, which is OK.

Yes, this is correct understanding on all points.

src/ripple/overlay/Slot.h

gregtatcam · 2020-06-25T18:53:37Z

src/ripple/overlay/Slot.h

+ * unsquelched, disconnected peer, or idling peer may transition Slot to
+ * Counting state.
+ */
+template <typename clock_type>


I'd rather not have a test file in the main code. I think a better way is to get rid of the clock template parameter altogether, in another PR.

src/ripple/overlay/Slot.h

src/ripple/overlay/Squelch.h

src/test/overlay/reduce_relay_test.cpp

seelabs

👍 Looks good! (left one minor comment about a journal param, but marking this approved now).

pwang200

Looks great!

pwang200

Still good.

seelabs

👍

nbougalis

I don't see anything obviously wrong with this PR, but I think the code is a bit hairy. This isn't a criticism of your changes @gregtatcam; the code was already hairy when you started working on this. Separately from this PR, we should try to revisit how we do some things, especially the HashRouter.

nbougalis · 2020-05-22T03:41:01Z

src/ripple/overlay/Message.h

+    Message(
+        ::google::protobuf::Message const& message,
+        int type,
+        boost::optional<PublicKey> validator = {});


Please document this parameter in the comment above.

It is documented. This is what it currently states:
@param validator Public Key of the source validator for Validation or Proposal message. Used to check if the message should be squelched.

nbougalis · 2020-08-18T08:57:17Z

src/ripple/overlay/impl/PeerImp.cpp

+{
+    if (!m->has_validatorpubkey())
+    {
+        JLOG(p_journal_.debug()) << "onMessage: TMSquelch, missing public key";


We shouldn't be logging such malformed messages; rather we should just charge the peer for sending us bad data and we should drop the message on the floor.

nbougalis · 2020-08-18T09:01:09Z

src/ripple/core/Config.h

@@ -179,6 +179,12 @@ class Config : public BasicConfig
    // Thread pool configuration
    std::size_t WORKERS = 0;

+    // Reduce-relay


Please add a comment that these parameters are experimental.

nbougalis · 2020-08-18T09:34:41Z

src/ripple/overlay/Overlay.h

-    /** Relay a validation. */
-    virtual void
-    relay(protocol::TMValidation& m, uint256 const& uid) = 0;
+    /** Relay a proposal. Return set


Prefer doxygen format:

/** Relay a proposal. @param m the serialized proposal @param uid the id used to identify this proposal @param validator The pubkey of the validator that issued this proposal @return the set of peers which have already sent us this proposal */

nbougalis · 2020-08-18T09:40:28Z

src/ripple/overlay/impl/OverlayImpl.cpp

@@ -99,6 +99,7 @@ OverlayImpl::Timer::on_timer(error_code ec)
    overlay_.m_peerFinder->once_per_second();
    overlay_.sendEndpoints();
    overlay_.autoConnect();
+    overlay_.deleteIdlePeers();


Does this really need to run every second?

Changed to once every 4 seconds. A peer is considered idle if no message is received in 8 seconds.

nbougalis · 2020-08-18T09:48:28Z

src/ripple/app/misc/HashRouter.cpp

@@ -49,12 +49,19 @@ HashRouter::addSuppression(uint256 const& key)

 bool
 HashRouter::addSuppressionPeer(uint256 const& key, PeerShortID peer)
+{
+    auto [added, _] = addSuppressionPeerWithStatus(key, peer);


I'd simply: return addSuppressionPeerWithStatus(key, peer).first;

nbougalis · 2020-08-18T10:09:56Z

src/ripple/app/misc/HashRouter.cpp

+}
+
+std::pair<bool, std::optional<Stopwatch::time_point>>
+HashRouter::addSuppressionPeerWithStatus(const uint256& key, PeerShortID peer)


The HashRouter API was already confusing enough. It's now become a little too confusing. I don't have an improvement to offer, but we need to think about this.

A server may receive multiple copies of a message from its directly connected peers. We select three peers as the source of validation and proposal messages from the validator and "squelch" other peers. This change reduces CPU and bandwidth costs and improves link latency.

Charge for bad TMSquelch Update comments

Add to manual tests

carlhua assigned nbougalis and pwang200 May 21, 2020

gregtatcam requested a review from seelabs May 21, 2020 21:10

nbougalis self-requested a review May 22, 2020 03:43

pwang200 reviewed May 28, 2020

View reviewed changes

src/ripple/overlay/Message.h Show resolved Hide resolved

src/ripple/overlay/Squelch.h Outdated Show resolved Hide resolved

src/ripple/overlay/SquelchCommon.h Outdated Show resolved Hide resolved

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

pwang200 reviewed Jun 1, 2020

View reviewed changes

src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved

pwang200 reviewed Jun 1, 2020

View reviewed changes

src/ripple/overlay/impl/OverlayImpl.h Outdated Show resolved Hide resolved

pwang200 reviewed Jun 1, 2020

View reviewed changes

src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved

pwang200 reviewed Jun 2, 2020

View reviewed changes

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

seelabs suggested changes Jun 5, 2020

View reviewed changes

seelabs reviewed Jun 16, 2020

View reviewed changes

seelabs reviewed Jun 18, 2020

View reviewed changes

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

seelabs reviewed Jun 18, 2020

View reviewed changes

src/test/overlay/reduce_relay_test.cpp Outdated Show resolved Hide resolved

seelabs reviewed Jun 18, 2020

View reviewed changes

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

seelabs approved these changes Jun 18, 2020

View reviewed changes

pwang200 reviewed Jun 19, 2020

View reviewed changes

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

src/ripple/overlay/Slot.h Outdated Show resolved Hide resolved

src/ripple/overlay/Squelch.h Outdated Show resolved Hide resolved

pwang200 approved these changes Jun 22, 2020

View reviewed changes

manojsdoshi added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Jun 25, 2020

manojsdoshi mentioned this pull request Jun 25, 2020

Proposed 1.6.0-b8 #3459

Closed

manojsdoshi closed this Jun 30, 2020

manojsdoshi reopened this Jun 30, 2020

carlhua requested review from pwang200 and seelabs July 8, 2020 15:36

seelabs removed the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Jul 22, 2020

seelabs reviewed Jul 22, 2020

View reviewed changes

pwang200 reviewed Jul 23, 2020

View reviewed changes

gregtatcam commented Jul 23, 2020

View reviewed changes

gregtatcam requested review from pwang200 and seelabs August 3, 2020 19:28

seelabs approved these changes Aug 4, 2020

View reviewed changes

pwang200 approved these changes Aug 6, 2020

View reviewed changes

gregtatcam force-pushed the reduce-relay1 branch from 3577885 to 49256bd Compare August 6, 2020 21:58

gregtatcam requested review from pwang200 and seelabs August 6, 2020 22:10

pwang200 approved these changes Aug 6, 2020

View reviewed changes

seelabs approved these changes Aug 17, 2020

View reviewed changes

nbougalis approved these changes Aug 18, 2020

View reviewed changes

carlhua added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Aug 24, 2020

gregtatcam added 2 commits August 25, 2020 10:05

Check idle peers every 4 seconds

8ec800f

Charge for bad TMSquelch Update comments

gregtatcam force-pushed the reduce-relay1 branch from 216c6e8 to 8ec800f Compare August 25, 2020 14:18

Fix unused variables errors

e3e26f3

gregtatcam requested a review from nbougalis August 26, 2020 17:32

gregtatcam added 4 commits August 27, 2020 18:28

Improve Travis build/unit-test on Windows

5d1f56d

Make random simulation test as optional

35fbbfe

Roll back travis script change

70675cb

Split random test into separate manual suite

c679547

Add to manual tests

nbougalis mentioned this pull request Sep 1, 2020

Proposed 1.7.0-b1 #3599

Merged

nbougalis closed this in #3599 Sep 8, 2020

Reduce Validation and Proposal message relaying #3412

Reduce Validation and Proposal message relaying #3412

Conversation

gregtatcam commented May 21, 2020

pwang200 left a comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

pwang200 left a comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

pwang200 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregtatcam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

pwang200 left a comment

Choose a reason for hiding this comment

pwang200 left a comment

Choose a reason for hiding this comment

seelabs left a comment

Choose a reason for hiding this comment

nbougalis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment